Chronic Kidney Disease

22160 - R for Bio Data Science

Group 15

Introduction

The dataset contains 25 features related to chronic kidney disease, collected from 400 individuals in India. In addition to chronic kidney disease (CKD), there is information on co-diagnoses: hypertension, diabetes, anemia, pedal edema, coronary artery disease


Analysis goal

Can we identify any physiological markers which are related to a chronic kidney disease diagnosis? If so, which ones?

Methods

Data cleaning and augmentation was done using the Tidyverse collection of packages.

  • Cleaning: Renaming columns and fixing variable types.

  • Augmenting: Divide into age groups, split and join, estimate globular filtration rate (GFR)

We conducted a correlation analysis and random forest prediction of which biomarkers best predict a CKD diagnosis. For this, we utilized the PerformanceAnalytics and randomForest packages.

Results - Kidney disease stages

Using the equation below, we could estimate GFR and the different stages of CKD people were in. Due to lack of sex data, we estimated an average of male and female GFR values. \[ \text{eGFR}_{\text{cr}} = 142 \times \min\left(\frac{\text{Scr}}{\kappa},\, 1\right)^{\alpha}\times \max\left(\frac{\text{Scr}}{\kappa},\, 1\right)^{-1.200}\times 0.9938^{\text{Age}}\times 1.012 \;\; \text{[if female]} \]

Results - random forest variables importance

Results - CKD and secondary diagnoses

Hypertension and diabetes was only present in those with a CKD diagnosis.

Results - diabetes hypertension analysis

tab <- data |>
  xtabs(~ hypertension + diabetes_mellitus, data = _)

tab
            diabetes_mellitus
hypertension  no yes
         no  220  31
         yes  41 106
tab |>
  chisq.test()

    Pearson's Chi-squared test with Yates' continuity correction

data:  tab
X-squared = 144.02, df = 1, p-value < 2.2e-16
V <- CramerV(tab) |>
  round(3)

sprintf("Cramér's V = %.3f", V)
[1] "Cramér's V = 0.607"

Results - Key predictors of CKD diagnosis

Discussion

Findings:

  • Random forest accurately predicted explanatory variables for CKD.
  • CKD is well predicted by albumin in urine, hemoglobin concentration, packed cell volume, red blood cells count, and creatinine levels.
  • GFR estimate aligns with CKD diagnosis.
  • Hypertension and diabetes is more common in those with CKD

Discussion

Caveats and possible improvements:

  • Small dataset

  • GFR estimate done without information on sex, meaning decreased accuracy

  • More information on the data source needed for more accurate conclusions

References

Data:

Chronic KIdney Disease dataset. Kaggel.com. Available: https://www.kaggle.com/datasets/mansoordaku/ckdisease/data

Packages:

Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L.D., François, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., et al. (2019). Welcome to the Tidyverse. Journal of Open Source Software 4, 1686. https://doi.org/10.21105/joss.01686.

Peterson, B.G., Carl, P., Boudt, K., Bennett, R., Ulrich, J., Zivot, E., Cornilly, D., Hung, E., Lestel, M., Balkissoon, K., et al. (2024). PerformanceAnalytics: Econometric Tools for Performance and Risk Analysis.

A.C. (Fortran, port), A.L. (R, and port), M.W. (R (2024). randomForest: Breiman and Cutlers Random Forests for Classification and Regression.

Miscellaneous:

CKD-EPI Creatinine Equation (2021) | National Kidney Foundation.

Kaufman, D.P., Basit, H., and Knohl, S.J. (2025). Physiology, Glomerular Filtration Rate. In StatPearls, (Treasure Island (FL): StatPearls Publishing), p.